Problem Statement
Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.
Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.
Attributes
People
Products
Promotion
Place
Target
!pip install sklearn_pandas
Requirement already satisfied: sklearn_pandas in c:\users\azamatov\anaconda3\lib\site-packages (2.2.0) Requirement already satisfied: scipy>=1.5.1 in c:\users\azamatov\anaconda3\lib\site-packages (from sklearn_pandas) (1.6.2) Requirement already satisfied: scikit-learn>=0.23.0 in c:\users\azamatov\anaconda3\lib\site-packages (from sklearn_pandas) (0.24.1) Requirement already satisfied: numpy>=1.18.1 in c:\users\azamatov\anaconda3\lib\site-packages (from sklearn_pandas) (1.19.5) Requirement already satisfied: pandas>=1.1.4 in c:\users\azamatov\anaconda3\lib\site-packages (from sklearn_pandas) (1.2.4) Requirement already satisfied: pytz>=2017.3 in c:\users\azamatov\anaconda3\lib\site-packages (from pandas>=1.1.4->sklearn_pandas) (2021.1) Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\azamatov\anaconda3\lib\site-packages (from pandas>=1.1.4->sklearn_pandas) (2.8.1) Requirement already satisfied: six>=1.5 in c:\users\azamatov\anaconda3\lib\site-packages (from python-dateutil>=2.7.3->pandas>=1.1.4->sklearn_pandas) (1.15.0) Requirement already satisfied: joblib>=0.11 in c:\users\azamatov\anaconda3\lib\site-packages (from scikit-learn>=0.23.0->sklearn_pandas) (1.0.1) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\azamatov\anaconda3\lib\site-packages (from scikit-learn>=0.23.0->sklearn_pandas) (2.1.0)
!pip install --user yellowbrick
Requirement already satisfied: yellowbrick in c:\users\azamatov\appdata\roaming\python\python38\site-packages (1.3.post1) Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in c:\users\azamatov\anaconda3\lib\site-packages (from yellowbrick) (3.3.4) Requirement already satisfied: cycler>=0.10.0 in c:\users\azamatov\anaconda3\lib\site-packages (from yellowbrick) (0.10.0) Requirement already satisfied: scipy>=1.0.0 in c:\users\azamatov\anaconda3\lib\site-packages (from yellowbrick) (1.6.2) Requirement already satisfied: scikit-learn>=0.20 in c:\users\azamatov\anaconda3\lib\site-packages (from yellowbrick) (0.24.1) Requirement already satisfied: numpy<1.20,>=1.16.0 in c:\users\azamatov\anaconda3\lib\site-packages (from yellowbrick) (1.19.5) Requirement already satisfied: six in c:\users\azamatov\anaconda3\lib\site-packages (from cycler>=0.10.0->yellowbrick) (1.15.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\azamatov\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.3.1) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\azamatov\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.4.7) Requirement already satisfied: pillow>=6.2.0 in c:\users\azamatov\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (8.2.0) Requirement already satisfied: python-dateutil>=2.1 in c:\users\azamatov\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.8.1) Requirement already satisfied: joblib>=0.11 in c:\users\azamatov\anaconda3\lib\site-packages (from scikit-learn>=0.20->yellowbrick) (1.0.1) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\azamatov\anaconda3\lib\site-packages (from scikit-learn>=0.20->yellowbrick) (2.1.0)
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.preprocessing import PowerTransformer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from yellowbrick.cluster import SilhouetteVisualizer
from sklearn_pandas import DataFrameMapper, gen_features
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
sns.set_theme()
PALETTE = sns.color_palette("Set2")
sns.set_context("paper", rc={"font.size":12,
"figure.titlesize":18,
"axes.titlesize":15,
"axes.labelsize":13,
"xtick.labelsize": 13,
"ytick.labelsize": 13,
"legend.fontsize": 9,
"legend.title_fontsize": 11})
sns.set_context("paper", rc={"font.size":12, "figure.titlesize":18, "axes.titlesize":15, "axes.labelsize":13, "xtick.labelsize": 13, "ytick.labelsize": 13, "legend.fontsize": 9, "legend.title_fontsize": 11}) # EDA
data = pd.read_csv('archive/marketing_campaign.csv', sep='\t',
index_col='ID',
parse_dates=['Dt_Customer'])
data.head(10)
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||||||||||
| 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 2012-04-09 | 58 | 635 | 88 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 2014-08-03 | 38 | 11 | 1 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 2013-08-21 | 26 | 426 | 49 | ... | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 2014-10-02 | 26 | 11 | 4 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 2014-01-19 | 94 | 173 | 43 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 7446 | 1967 | Master | Together | 62513.0 | 0 | 1 | 2013-09-09 | 16 | 520 | 42 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 965 | 1971 | Graduation | Divorced | 55635.0 | 0 | 1 | 2012-11-13 | 34 | 235 | 65 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 6177 | 1985 | PhD | Married | 33454.0 | 1 | 0 | 2013-08-05 | 32 | 76 | 10 | ... | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 4855 | 1974 | PhD | Together | 30351.0 | 1 | 0 | 2013-06-06 | 19 | 14 | 0 | ... | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 5899 | 1950 | PhD | Together | 5648.0 | 1 | 1 | 2014-03-13 | 68 | 28 | 0 | ... | 20 | 1 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
10 rows × 28 columns
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2240 entries, 5524 to 9405 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year_Birth 2240 non-null int64 1 Education 2240 non-null object 2 Marital_Status 2240 non-null object 3 Income 2216 non-null float64 4 Kidhome 2240 non-null int64 5 Teenhome 2240 non-null int64 6 Dt_Customer 2240 non-null datetime64[ns] 7 Recency 2240 non-null int64 8 MntWines 2240 non-null int64 9 MntFruits 2240 non-null int64 10 MntMeatProducts 2240 non-null int64 11 MntFishProducts 2240 non-null int64 12 MntSweetProducts 2240 non-null int64 13 MntGoldProds 2240 non-null int64 14 NumDealsPurchases 2240 non-null int64 15 NumWebPurchases 2240 non-null int64 16 NumCatalogPurchases 2240 non-null int64 17 NumStorePurchases 2240 non-null int64 18 NumWebVisitsMonth 2240 non-null int64 19 AcceptedCmp3 2240 non-null int64 20 AcceptedCmp4 2240 non-null int64 21 AcceptedCmp5 2240 non-null int64 22 AcceptedCmp1 2240 non-null int64 23 AcceptedCmp2 2240 non-null int64 24 Complain 2240 non-null int64 25 Z_CostContact 2240 non-null int64 26 Z_Revenue 2240 non-null int64 27 Response 2240 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(24), object(2) memory usage: 507.5+ KB
There are two columns that are not mentioned in dataset decription: Z_CostCount and Z_Revenue
data['Z_Revenue'].value_counts()
11 2240 Name: Z_Revenue, dtype: int64
data['Z_CostContact'].value_counts()
3 2240 Name: Z_CostContact, dtype: int64
As long as there are only one values in both columns, i can delete them
data.drop(columns=['Z_Revenue', 'Z_CostContact'], inplace=True)
data[data.duplicated(keep=False)].sort_values(by='Income')
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||||||||||
| 6864 | 1989 | Master | Divorced | 10979.0 | 0 | 0 | 2014-05-22 | 34 | 8 | 4 | ... | 0 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 10617 | 1989 | Master | Divorced | 10979.0 | 0 | 0 | 2014-05-22 | 34 | 8 | 4 | ... | 0 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 234 | 1979 | Graduation | Divorced | 15287.0 | 1 | 0 | 2012-10-10 | 60 | 1 | 2 | ... | 1 | 2 | 7 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 10264 | 1979 | Graduation | Divorced | 15287.0 | 1 | 0 | 2012-10-10 | 60 | 1 | 2 | ... | 1 | 2 | 7 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 8420 | 1970 | 2n Cycle | Married | 15315.0 | 0 | 0 | 2013-03-08 | 27 | 7 | 4 | ... | 0 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5536 | 1959 | Graduation | Together | 87771.0 | 0 | 1 | 2013-05-22 | 61 | 1492 | 38 | ... | 10 | 4 | 6 | 0 | 1 | 1 | 1 | 1 | 0 | 1 |
| 6024 | 1953 | Graduation | Together | 94384.0 | 0 | 0 | 2013-04-03 | 62 | 1111 | 24 | ... | 8 | 5 | 2 | 0 | 1 | 1 | 1 | 0 | 0 | 1 |
| 5386 | 1953 | Graduation | Together | 94384.0 | 0 | 0 | 2013-04-03 | 62 | 1111 | 24 | ... | 8 | 5 | 2 | 0 | 1 | 1 | 1 | 0 | 0 | 1 |
| 4248 | 1960 | Master | Single | 98777.0 | 0 | 0 | 2014-02-17 | 23 | 1000 | 19 | ... | 6 | 9 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 7451 | 1960 | Master | Single | 98777.0 | 0 | 0 | 2014-02-17 | 23 | 1000 | 19 | ... | 6 | 9 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
358 rows × 26 columns
data[data.duplicated(keep='first')]
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||||||||||
| 4119 | 1963 | Master | Together | 38620.0 | 0 | 0 | 2013-11-05 | 56 | 112 | 17 | ... | 5 | 3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6935 | 1951 | 2n Cycle | Married | 78497.0 | 0 | 0 | 2013-01-12 | 44 | 207 | 26 | ... | 7 | 12 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 10144 | 1976 | Graduation | Together | 51369.0 | 0 | 1 | 2012-10-25 | 84 | 297 | 7 | ... | 2 | 4 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8652 | 1946 | Graduation | Together | 37760.0 | 0 | 0 | 2012-08-31 | 20 | 84 | 5 | ... | 1 | 6 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 933 | 1978 | PhD | Married | 37717.0 | 1 | 0 | 2012-11-23 | 31 | 9 | 0 | ... | 0 | 2 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4838 | 1978 | Graduation | Together | 22775.0 | 1 | 0 | 2013-06-19 | 40 | 5 | 1 | ... | 0 | 2 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7366 | 1982 | Master | Single | 75777.0 | 0 | 0 | 2013-04-07 | 12 | 712 | 26 | ... | 6 | 11 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
| 4749 | 1968 | Graduation | Together | 58554.0 | 1 | 1 | 2012-09-26 | 55 | 368 | 24 | ... | 2 | 6 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8372 | 1974 | Graduation | Married | 34421.0 | 1 | 0 | 2013-01-07 | 81 | 3 | 3 | ... | 0 | 2 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4001 | 1946 | PhD | Together | 64014.0 | 2 | 1 | 2014-10-06 | 56 | 406 | 0 | ... | 2 | 5 | 7 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
182 rows × 26 columns
There are 358 duplicates and only 358-182 = 176 unique occurances, so there are even duplucates with 3 or more copies. I guess they are not valid, because i don't think that it could be possible to have absolutely equal customers. Equal amount spent on different types of products, number of pruchases, number of website visits in last month and all other features... Without dataset author i can't be sure in their validity or invalidity, but i will drop them
data.drop_duplicates(inplace=True)
I have no idea in which year this dataset was collected, because the dataset creator doesn't provide us with this information
So let's assume that the dataset was collected on the next day of the last customer enrollment + 2 years, because most of the features are aggregated for last 2 years.
print('The last day a client was enrolled is ', data['Dt_Customer'].dt.date.max())
The last day a client was enrolled is 2014-12-06
There are time-connected features: Year_Birth and Dt_Customer. I will transform Year_Birth into Age feature by substracting year of birth from 2016. And it also nice to transform datetime feature Dt_Customer into integer CustomerFor, which is the amount of days since customer enrollment
data.insert(1, 'Age', 2016 - data['Year_Birth'])
data.insert(2, 'CustomerFor', (np.datetime64('2016-12-07') - data['Dt_Customer']).dt.days)
data.drop(columns=['Dt_Customer', 'Year_Birth'], inplace=True)
Age distribution
plt.figure(figsize=(24, 6))
plt.title('Clients age distribution')
ax = sns.histplot(data['Age'].sort_values(), bins=26)
sns.rugplot(data=data['Age'], height=.05)
plt.xticks(np.linspace(data['Age'].min(), data['Age'].max(), 26, dtype=int, endpoint = True))
plt.show()
There are really old customers ._.
I think they are missclicked or something while entering year of birth. Everything else looks okay, so i will drop them out while training the model, but will include them in the dataset to predict
data.query('Age > 110')
| Age | CustomerFor | Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||||||||||
| 7829 | 116 | 1168 | 2n Cycle | Divorced | 36640.0 | 1 | 0 | 99 | 15 | 6 | ... | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 11004 | 123 | 935 | 2n Cycle | Single | 60182.0 | 0 | 1 | 23 | 8 | 0 | ... | 0 | 2 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1150 | 117 | 1168 | PhD | Together | 83532.0 | 0 | 0 | 36 | 755 | 144 | ... | 6 | 4 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 rows × 26 columns
plt.figure(figsize=(16,5))
plt.title(f'Customers income boxplot')
ax = sns.boxplot(data['Income'], palette=PALETTE)
plt.show()
There are some outliers. Let's look on them.
data.query('Income > 140000')
| Age | CustomerFor | Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||||||||||
| 8475 | 43 | 1069 | PhD | Married | 157243.0 | 0 | 1 | 98 | 20 | 2 | ... | 22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1503 | 40 | 1372 | PhD | Together | 162397.0 | 1 | 1 | 31 | 85 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5555 | 41 | 889 | Graduation | Divorced | 153924.0 | 0 | 0 | 81 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1501 | 34 | 1704 | PhD | Married | 160803.0 | 0 | 0 | 21 | 55 | 16 | ... | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5336 | 45 | 1341 | Master | Together | 157733.0 | 1 | 0 | 37 | 39 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4931 | 39 | 1318 | Graduation | Together | 157146.0 | 0 | 0 | 13 | 1 | 0 | ... | 28 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 11181 | 67 | 1196 | PhD | Married | 156924.0 | 0 | 0 | 85 | 2 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9432 | 39 | 1400 | Graduation | Together | 666666.0 | 1 | 0 | 23 | 9 | 14 | ... | 1 | 3 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 rows × 26 columns
There is no reason to consider this data invalid. But while using clutering methods, outliers can negatively affect. I'll also drop these rows while training
plt.figure(figsize=(24, 6))
plt.title('Customers yearly household income distribution')
ax = sns.distplot(data.query('Income < 150000')['Income'], rug=True)
plt.show()
Filling nulls
print(f'There are {data["Income"].isna().sum()} missing Income values')
There are 24 missing Income values
Let's impute this NaN's with median
data['Income'].fillna(data['Income'].median(), inplace=True)
data['Education'].value_counts()
Graduation 1030 PhD 450 Master 341 2n Cycle 188 Basic 49 Name: Education, dtype: int64
Graduation and 2n Cycle are not clear values.
All countries conveyed their national systems to a two cycle structure consisting of a first (undergraduate) and a second (graduate) cycle. Source: EHEA
According to three Cycle System from the European Higher Education Area, 2n Cycle refers to Master degree. And Graduation means that the person is on the second - graduate cycle, so in fact he finished the first - undergraduate cycle (in many countries named Bachelor)
So the changes are as follow:
2n Cycle -> MasterGraduation -> Bachelordata['Education'].replace(['2n Cycle', 'Graduation'],
['Master', 'Bachelor'], inplace=True)
sizes = dict(data['Education'].value_counts())
plt.figure(figsize=(12, 8))
plt.title("Education degrees proportion")
plt.pie(sizes.values(), labels=sizes.keys(), autopct="%.1f%%", pctdistance=0.85, shadow=True, colors=PALETTE)
plt.legend(title="Client's eduation", labels=sizes.keys(), bbox_to_anchor=(1, 1))
# add a circle at the center to transform it in a donut chart
my_circle=plt.Circle( (0,0), 0.7, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.show()
Let's see how the Income varies across different education degrees
plt.figure(figsize=(18, 6))
sns.boxplot(data=data.query('Income < 500000'), x='Education', y='Income', palette=PALETTE, showfliers=False,)
plt.title('Customers income by education degree')
plt.show()
Bachelor, PhD and Master degrees Income is around the same, but the Basic degree Income is definetly lower than others
data['Marital_Status'].value_counts()
Married 795 Together 521 Single 448 Divorced 216 Widow 71 Alone 3 YOLO 2 Absurd 2 Name: Marital_Status, dtype: int64
We can merge Alone to Single category, but YOLO and Absurd are not clear.
YOLO
YOLO (You only live once) accords to the lifestyle or trend that many young people have adopted as a way to better enjoy life, and not to think about saving up for the future.
I can assume that YOLO category refers to people who do not have a permanent partner, so i will merge it to Single
Absurd
In philosophy, "the Absurd" refers to the conflict between the human tendency to seek inherent value and meaning in life, and the human inability to find these with any certainty.
I would also merge Absurd to Single
data['Marital_Status'].replace(['YOLO', 'Absurd', 'Alone'], 'Single', inplace=True)
Now let's look on the marital statuses proportions
sizes = dict(data['Marital_Status'].value_counts())
plt.figure(figsize=(12, 8))
plt.title("Marital statuses proportion")
plt.pie(sizes.values(), labels=sizes.keys(), autopct="%.1f%%", pctdistance=0.85, shadow=True, colors=PALETTE)
plt.legend(title="Client's marital status", labels=sizes.keys(), bbox_to_anchor=(1, 1))
# add a circle at the center to transform it in a donut chart
my_circle=plt.Circle( (0,0), 0.7, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.show()
I have an idea to combine the statuses [Single, Widow, Divorced ] and [Together, Married], because the client, as a consumer, is better described not by a specific status, but by the presence of a partner
data['HasPartner'] = data["Marital_Status"].replace({'Single': 'No',
'Widow': 'No',
'Divorced': 'No',
'Together': 'Yes',
'Married': 'Yes'})
sizes = dict(data['HasPartner'].value_counts())
plt.figure(figsize=(12, 8))
plt.title("Does the customer has a partner")
plt.pie(sizes.values(), labels=sizes.keys(), autopct="%.1f%%", pctdistance=0.85, shadow=True, colors=PALETTE)
plt.show()
So the most customers are in relationships
data['Kidhome'].value_counts()
0 1184 1 831 2 43 Name: Kidhome, dtype: int64
data['Teenhome'].value_counts()
0 1060 1 948 2 50 Name: Teenhome, dtype: int64
Let's consider all children in the household with NumChildren feature
I would also introduce feature HasChildren which equals 1, if customer has one or more children, and equals 0 if customer doesn't have children
data['NumChildren'] = data['Kidhome'] + data['Teenhome']
data['HasChildren'] = (data['NumChildren'] >= 1).astype('int64')
sizes = dict(data['NumChildren'].value_counts())
plt.figure(figsize=(12, 8))
plt.title("Number of children in customers households")
plt.pie(sizes.values(), labels=sizes.keys(), autopct="%.1f%%", pctdistance=0.85, shadow=True, colors=PALETTE)
plt.legend(title="Number of children", labels=sizes.keys(), bbox_to_anchor=(1, 1))
# add a circle at the center to transform it in a donut chart
my_circle=plt.Circle( (0,0), 0.7, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.show()
We see that the most customers have 1 child
plt.figure(figsize=(18, 6))
ax = sns.boxplot(data=data.query('Income < 500000'), x='HasChildren', y='Income', palette=PALETTE, showfliers=False)
plt.title('Customers income depending on having children')
ax.set_xticklabels(['No', 'Yes'])
ax.set_xlabel('Does the customer has children')
plt.show()
Customers without children have bigger income
plt.figure(figsize=(18, 6))
ax = sns.boxplot(data=data.query('Income < 500000'), x='HasPartner', y='Income', palette=PALETTE, showfliers=False)
plt.title('Customers income depending on having a partner')
ax.set_xticklabels(['No', 'Yes'])
ax.set_xlabel('Does the customer has partner')
plt.show()
mnt = data.filter(like='Mnt').apply(lambda x: sum(x), axis=0)
sizes = dict(mnt)
plt.figure(figsize=(12, 8))
plt.title("Amount spent on different types of products")
plt.pie(sizes.values(), labels=['Wine', 'Fruits', 'Meat',
'Fish', 'Sweets', 'Gold'], autopct="%.1f%%", pctdistance=0.85, shadow=True, colors=PALETTE)
plt.legend(title="Product type", labels=['Wine', 'Fruits', 'Meat',
'Fish', 'Sweets', 'Gold'], bbox_to_anchor=(1, 1))
# add a circle at the center to transform it in a donut chart
my_circle=plt.Circle( (0,0), 0.7, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.show()
Wines and Meat products are the most spent on
Let's introduce MntTotal feature, which is the total amount spent by customer in the last 2 years
data['MntTotal'] = data.filter(like='Mnt').sum(axis=1)
And i will calculate the percent of amount spent on each product type from total amount spent for each customer
data = data.assign(
percentWines=lambda x: x['MntWines'] / x['MntTotal'] * 100,
percentMeat=lambda x: x['MntMeatProducts'] / x['MntTotal'] * 100,
percentFruits=lambda x: x['MntFruits'] / x['MntTotal'] * 100,
percentFish=lambda x: x['MntFishProducts'] / x['MntTotal'] * 100,
percentSweets=lambda x: x['MntSweetProducts'] / x['MntTotal'] * 100,
percentGold=lambda x: x['MntGoldProds'] / x['MntTotal'] * 100,
)
Let's analyze this features in terms of education
fig, axes = plt.subplots(4, 6, figsize=(16, 15), sharey=True)
fig.suptitle('Percent of amount spent on each product type from total amount spent by education', fontsize=20)
for i, value in enumerate(data['Education'].unique()):
sns.boxplot(data=data.query(f'Education == "{value}"'), y='percentWines', showfliers=False, color=PALETTE[i], ax=axes[i, 0])
axes[i, 0].set_ylim(0, 100)
axes[i, 0].set_xlabel('Wine')
axes[i, 0].set_ylabel('')
sns.boxplot(data=data.query(f'Education == "{value}"'), y='percentMeat', showfliers=False, color=PALETTE[i], ax=axes[i, 1])
axes[i, 1].set_xlabel('Meat')
axes[i, 1].set_ylabel('')
sns.boxplot(data=data.query(f'Education == "{value}"'), y='percentFruits', showfliers=False, color=PALETTE[i], ax=axes[i, 2])
axes[i, 2].set_xlabel('Fruits')
axes[i, 2].set_ylabel('')
axes[i, 2].set_title(f'{value}', x=1)
sns.boxplot(data=data.query(f'Education == "{value}"'), y='percentFish', showfliers=False, color=PALETTE[i], ax=axes[i, 3])
axes[i, 3].set_xlabel('Fish')
axes[i, 3].set_ylabel('')
sns.boxplot(data=data.query(f'Education == "{value}"'), y='percentSweets', showfliers=False, color=PALETTE[i], ax=axes[i, 4])
axes[i, 4].set_xlabel('Sweets')
axes[i, 4].set_ylabel('')
sns.boxplot(data=data.query(f'Education == "{value}"'), y='percentGold', showfliers=False, color=PALETTE[i], ax=axes[i, 5])
axes[i, 5].set_xlabel('Gold')
axes[i, 5].set_ylabel('')
plt.tight_layout()
Customers with PhD, Bachelor and Master degree mostly spend on Wine and Meat products. Also have to notice, that more than a half of PhD's spendings are Wine produts (in median)
And Basic degree customers spend more on Gold, Fish, Sweets
Now let's check the total amount spent depending on education degrees and number of children
plt.figure(figsize=(18, 6))
ax = sns.boxplot(data=data, x='Education', y='MntTotal', palette=PALETTE)
plt.title('Customers amount spent depending on an education degree')
plt.show()
Customers with PhD degree spent the most amount in last 2 years, customers with Basic degree - the least amount. That corresponds to the Income distribution
plt.figure(figsize=(18, 6))
ax = sns.boxplot(data=data, x='HasChildren', y='MntTotal', palette=PALETTE)
plt.title('Customers amount spent depending on having children')
ax.set_xticklabels(['No', 'Yes'])
ax.set_xlabel('Does the customer has children')
plt.show()
And the amount spent by parents and not parents differs a lot
num = data.filter(regex='Num[^Deals].+Purchases').sum(axis=0)
sizes = dict(num)
plt.figure(figsize=(12, 8))
plt.title("Shopping types proportions")
plt.pie(sizes.values(), labels=['Website', 'Catalog', 'Store'], autopct="%.1f%%", pctdistance=0.85, shadow=True, colors=PALETTE)
plt.legend(title="Purchased at", labels=['Website', 'Catalog', 'Store'], bbox_to_anchor=(1, 1))
plt.show()
The most purchases are from store
There is interesting insight that customers with Basic education degree have more website visits per month than others:
fig, axes= plt.subplots(2, 1, figsize=(18, 13))
sns.boxplot(data=data, x='Education', y='NumWebVisitsMonth', palette=PALETTE, showfliers=False, ax=axes[0])
axes[0].set_title('Number of website visits per month depending on education degree')
axes[0].set_xlabel('')
sns.boxplot(data=data, x='Education', y='NumWebPurchases', palette=PALETTE, showfliers=False, ax=axes[1])
axes[1].set_title('Number of web purchases depending on education degree')
axes[1].set_xlabel('')
plt.show()
Despite that, they do not buy more on the websites. The might just monitoring the deals. So the store could publish more deals on the site to force them buy more
Same thing with customers, that have children:
fig, axes= plt.subplots(2, 1, figsize=(18, 12))
sns.boxplot(data=data, x='HasChildren', y='NumWebVisitsMonth', palette=PALETTE, showfliers=False, ax=axes[0])
axes[0].set_title('Number of website visits per month depending on having children')
axes[0].set_xlabel('')
axes[0].set_xticklabels(['No children', 'One or more chidren'])
sns.boxplot(data=data, x='HasChildren', y='NumWebPurchases', palette=PALETTE, showfliers=False, ax=axes[1])
axes[1].set_title('Number of web purchases depending on having children')
axes[1].set_xlabel('')
axes[1].set_xticklabels(['No children', 'One or more chidren'])
plt.show()
data['NumTotalPurchases'] = data.filter(regex='Num[^Deals].+Purchases').sum(axis=1)
The feature NumTotalPurchases is the sum of all purchases made by a customer
data.loc[(data['NumTotalPurchases'] == 0) & (data['MntTotal'] != 0)]
| Age | CustomerFor | Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | ... | NumChildren | HasChildren | MntTotal | percentWines | percentMeat | percentFruits | percentFish | percentSweets | percentGold | NumTotalPurchases | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||||||||||
| 5555 | 41 | 889 | Bachelor | Divorced | 153924.0 | 0 | 0 | 81 | 1 | 1 | ... | 0 | 0 | 6 | 16.666667 | 16.666667 | 16.666667 | 16.666667 | 16.666667 | 16.666667 | 0 |
| 3955 | 51 | 899 | Bachelor | Divorced | 4861.0 | 0 | 0 | 20 | 2 | 1 | ... | 0 | 0 | 6 | 33.333333 | 16.666667 | 16.666667 | 16.666667 | 0.000000 | 16.666667 | 0 |
| 6862 | 45 | 934 | Bachelor | Divorced | 1730.0 | 0 | 0 | 65 | 1 | 1 | ... | 0 | 0 | 8 | 12.500000 | 37.500000 | 12.500000 | 12.500000 | 12.500000 | 12.500000 | 0 |
| 11110 | 43 | 1334 | Bachelor | Single | 3502.0 | 1 | 0 | 56 | 2 | 1 | ... | 1 | 1 | 5 | 40.000000 | 20.000000 | 20.000000 | 0.000000 | 0.000000 | 20.000000 | 0 |
| 9931 | 53 | 898 | PhD | Married | 4023.0 | 1 | 1 | 29 | 5 | 0 | ... | 2 | 1 | 9 | 55.555556 | 11.111111 | 0.000000 | 11.111111 | 11.111111 | 11.111111 | 0 |
| 11181 | 67 | 1196 | PhD | Married | 156924.0 | 0 | 0 | 85 | 2 | 1 | ... | 0 | 0 | 8 | 25.000000 | 25.000000 | 12.500000 | 12.500000 | 12.500000 | 12.500000 | 0 |
6 rows × 37 columns
There are 6 customers with 0 purchases, but total amount spent is not 0. Seems like incorrectly collected data, let's delete these rows.
data.drop(data.loc[(data['NumTotalPurchases'] == 0) & (data['MntTotal'] != 0)].index, inplace=True)
Let's look on the correlation between CustomerFor and NumTotalPurchases
plt.figure(figsize=(16, 6))
plt.title('Customer for vs number of purchases')
sns.scatterplot(data=data, x='CustomerFor', y='NumTotalPurchases', alpha=0.5)
plt.xlabel('Customer for (days)')
plt.ylabel('Total number of purchases')
plt.show()
Seems like there is no correlation between CustomerFor and NumTotalPurchases. So i assume that information about number of purchases is also collected in the last 2 years, and we can evaluate customer's activity with NumTotalPurchases
AvgCheck is the average check of the customers purchases
data['AvgCheck'] = data['MntTotal'] / data['NumTotalPurchases']
plt.figure(figsize=(16,5))
plt.title(f'Customers purchase frequency boxplot (days)')
ax = sns.boxplot(data.query('AvgCheck < 1600')['AvgCheck'], palette=PALETTE)
plt.xticks(np.linspace(0, 250, 11, dtype=int))
plt.show()
There is also one outlier with 1679 average check
data.query('AvgCheck > 1500')
| Age | CustomerFor | Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | ... | HasChildren | MntTotal | percentWines | percentMeat | percentFruits | percentFish | percentSweets | percentGold | NumTotalPurchases | AvgCheck | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||||||||||
| 8720 | 38 | 1460 | Master | Together | 51533.0 | 0 | 0 | 53 | 32 | 2 | ... | 0 | 1679 | 1.905896 | 95.711733 | 0.119119 | 0.714711 | 0.238237 | 1.310304 | 1 | 1679.0 |
1 rows × 38 columns
Let's add AcceptedTotal feature which is the number of accepted campaigns by customer
data['AcceptedTotal'] = data.filter(regex='Accepted|Response').sum(axis=1)
plt.figure(figsize=(18, 6))
ax = sns.countplot(data['AcceptedTotal'], palette=PALETTE)
plt.title('Number of accepted campaigns by customers')
plt.xlabel('Accepted total')
# annotating the bars
for p in ax.patches:
ax.annotate(f'{p.get_height()}', (p.get_x()+0.30, p.get_height()+3))
The majority of customers didn't accepted any campaign
total = data.groupby('AcceptedTotal').sum().reset_index()
total
| AcceptedTotal | Age | CustomerFor | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | ... | HasChildren | MntTotal | percentWines | percentMeat | percentFruits | percentFish | percentSweets | percentGold | NumTotalPurchases | AvgCheck | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 69754 | 1829783 | 72417505.0 | 750 | 820 | 75330 | 321642 | 33514 | 188040 | ... | 1174 | 681390 | 65219.039141 | 37163.249189 | 7947.485706 | 11519.884753 | 8103.157796 | 18347.183415 | 16824 | 45092.127532 |
| 1 | 1 | 16893 | 439732 | 19584933.0 | 122 | 164 | 15791 | 148182 | 10814 | 80399 | ... | 220 | 285270 | 17288.823648 | 8563.345590 | 1312.328686 | 1907.139665 | 1501.385281 | 4126.977131 | 5089 | 18285.824019 |
| 2 | 2 | 6278 | 170055 | 8255826.0 | 38 | 45 | 5328 | 80421 | 4840 | 39290 | ... | 66 | 143835 | 6923.459559 | 3232.643823 | 529.176335 | 612.829615 | 471.188162 | 1530.702506 | 2102 | 8152.279449 |
| 3 | 3 | 2069 | 61138 | 3673928.0 | 3 | 13 | 2047 | 37472 | 2558 | 18206 | ... | 13 | 69530 | 2588.213200 | 1174.937038 | 164.250368 | 275.622428 | 171.170592 | 325.806374 | 906 | 3838.849095 |
| 4 | 4 | 1446 | 40620 | 2698811.0 | 2 | 4 | 1526 | 29765 | 1898 | 15857 | ... | 5 | 54393 | 1789.893794 | 966.999260 | 109.180101 | 116.017155 | 136.356585 | 181.553106 | 682 | 2720.071416 |
| 5 | 5 | 424 | 10957 | 776573.0 | 0 | 1 | 385 | 7976 | 362 | 3335 | ... | 1 | 13183 | 500.089620 | 238.869836 | 29.314826 | 38.708212 | 43.445296 | 49.572211 | 184 | 691.765987 |
6 rows × 36 columns
plt.figure(figsize=(10, 10))
accepted_data = pd.Series({i: 0 for i in range(6)})
tmp_data = accepted_data
for i, value in enumerate(data['Education'].unique()):
tmp_data += accepted_data
accepted_data = (data.query(f'Education == "{value}"')['AcceptedTotal'].value_counts()
.divide(data['AcceptedTotal'].value_counts())
.multiply(100))
ax = accepted_data.plot(kind='bar', bottom=tmp_data, color=PALETTE[i], label=f'{value}')
del tmp_data
ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
plt.title('Number of accepted campaigns proportion by education degree', y=1.05)
plt.xlabel('Number of accepted campaigns')
plt.ylabel('%')
plt.legend(bbox_to_anchor=(1,1))
plt.show()
There are no Basic degree customers, who accepted more than 2 campaigns
plt.figure(figsize=(10, 10))
accepted_data = pd.Series({i: 0 for i in range(6)})
tmp_data = accepted_data
for i, value in enumerate(data['HasChildren'].unique()):
tmp_data += accepted_data
accepted_data = (data.query(f'HasChildren == {value}')['AcceptedTotal'].value_counts()
.divide(data['AcceptedTotal'].value_counts())
.multiply(100))
ax = accepted_data.plot(kind='bar', bottom=tmp_data, color=PALETTE[i], label=f'{value}')
del tmp_data
ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
plt.title('Number of accepted campaigns proportion by having children', y=1.05)
plt.xlabel('Number of accepted campaigns')
plt.ylabel('%')
plt.legend(['Customers without children', 'Customers with one or more children'], bbox_to_anchor=(1,1))
plt.show()
This graph illustrates us that the more campaigns was accepted by customers, the more likely it would be customer without children
data_cleaned = data.drop(data[(data['Income'] > 140000) |
(data['Age'] > 100) |
(data['AvgCheck'] > 150)].index)
data_cleaned
| Age | CustomerFor | Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | ... | MntTotal | percentWines | percentMeat | percentFruits | percentFish | percentSweets | percentGold | NumTotalPurchases | AvgCheck | AcceptedTotal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||||||||||
| 5524 | 59 | 1703 | Bachelor | Single | 58138.0 | 0 | 0 | 58 | 635 | 88 | ... | 1617 | 39.270254 | 33.766234 | 5.442177 | 10.636982 | 5.442177 | 5.442177 | 22 | 73.500000 | 1 |
| 2174 | 62 | 857 | Bachelor | Single | 46344.0 | 1 | 1 | 38 | 11 | 1 | ... | 27 | 40.740741 | 22.222222 | 3.703704 | 7.407407 | 3.703704 | 22.222222 | 4 | 6.750000 | 0 |
| 4141 | 51 | 1204 | Bachelor | Together | 71613.0 | 0 | 0 | 26 | 426 | 49 | ... | 776 | 54.896907 | 16.365979 | 6.314433 | 14.304124 | 2.706186 | 5.412371 | 20 | 38.800000 | 0 |
| 6182 | 32 | 797 | Bachelor | Together | 26646.0 | 1 | 0 | 26 | 11 | 4 | ... | 53 | 20.754717 | 37.735849 | 7.547170 | 18.867925 | 5.660377 | 9.433962 | 6 | 8.833333 | 0 |
| 5324 | 35 | 1053 | PhD | Married | 58293.0 | 1 | 0 | 94 | 173 | 43 | ... | 422 | 40.995261 | 27.962085 | 10.189573 | 10.900474 | 6.398104 | 3.554502 | 14 | 30.142857 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8080 | 30 | 1573 | Bachelor | Single | 26816.0 | 0 | 0 | 50 | 5 | 1 | ... | 22 | 22.727273 | 27.272727 | 4.545455 | 13.636364 | 18.181818 | 13.636364 | 3 | 7.333333 | 0 |
| 10870 | 49 | 1273 | Bachelor | Married | 61223.0 | 0 | 1 | 46 | 709 | 43 | ... | 1341 | 52.870992 | 13.571961 | 3.206562 | 3.131991 | 8.799403 | 18.419090 | 16 | 83.812500 | 0 |
| 7270 | 35 | 1047 | Bachelor | Divorced | 56981.0 | 0 | 0 | 91 | 908 | 48 | ... | 1241 | 73.166801 | 17.485898 | 3.867849 | 2.578566 | 0.966962 | 1.933924 | 18 | 68.944444 | 1 |
| 8235 | 60 | 1048 | Master | Together | 69245.0 | 0 | 1 | 8 | 428 | 30 | ... | 843 | 50.771056 | 25.385528 | 3.558719 | 9.489917 | 3.558719 | 7.236062 | 21 | 40.142857 | 0 |
| 9405 | 62 | 1514 | PhD | Married | 52869.0 | 1 | 1 | 40 | 84 | 3 | ... | 172 | 48.837209 | 35.465116 | 1.744186 | 1.162791 | 0.581395 | 12.209302 | 8 | 21.500000 | 1 |
2035 rows × 39 columns
NUMERICAL_FEATURES = ['Age', 'Income', 'NumChildren', 'CustomerFor',
'Recency', 'MntWines', 'MntTotal',
'NumTotalPurchases', 'AcceptedTotal', 'AvgCheck']
sns.pairplot(data=data_cleaned[NUMERICAL_FEATURES],
kind='scatter', plot_kws={'alpha':0.3})
plt.show()
We can see different areas of objects on some scatterplots. MntTotal and NumTotalPurchases, for instance, or AvgCheck and Age
That happens, because NumTotalPurchases has multimodal distribution. And AvgCheck calculates depending on NumTotalPurchases
corr_matr = data_cleaned[NUMERICAL_FEATURES].corr(method='pearson')
plt.figure(figsize=(10,10))
sns.heatmap(corr_matr, annot=True, cmap='coolwarm', square=True)
plt.title("Pearson's correlation heatmap")
plt.show()
All correlations are clear and explainable
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2052 entries, 5524 to 9405 Data columns (total 39 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 2052 non-null int64 1 CustomerFor 2052 non-null int64 2 Education 2052 non-null object 3 Marital_Status 2052 non-null object 4 Income 2052 non-null float64 5 Kidhome 2052 non-null int64 6 Teenhome 2052 non-null int64 7 Recency 2052 non-null int64 8 MntWines 2052 non-null int64 9 MntFruits 2052 non-null int64 10 MntMeatProducts 2052 non-null int64 11 MntFishProducts 2052 non-null int64 12 MntSweetProducts 2052 non-null int64 13 MntGoldProds 2052 non-null int64 14 NumDealsPurchases 2052 non-null int64 15 NumWebPurchases 2052 non-null int64 16 NumCatalogPurchases 2052 non-null int64 17 NumStorePurchases 2052 non-null int64 18 NumWebVisitsMonth 2052 non-null int64 19 AcceptedCmp3 2052 non-null int64 20 AcceptedCmp4 2052 non-null int64 21 AcceptedCmp5 2052 non-null int64 22 AcceptedCmp1 2052 non-null int64 23 AcceptedCmp2 2052 non-null int64 24 Complain 2052 non-null int64 25 Response 2052 non-null int64 26 HasPartner 2052 non-null object 27 NumChildren 2052 non-null int64 28 HasChildren 2052 non-null int64 29 MntTotal 2052 non-null int64 30 percentWines 2052 non-null float64 31 percentMeat 2052 non-null float64 32 percentFruits 2052 non-null float64 33 percentFish 2052 non-null float64 34 percentSweets 2052 non-null float64 35 percentGold 2052 non-null float64 36 NumTotalPurchases 2052 non-null int64 37 AvgCheck 2052 non-null float64 38 AcceptedTotal 2052 non-null int64 dtypes: float64(8), int64(28), object(3) memory usage: 705.8+ KB
SELECTED_FEATURES = ['AvgCheck', 'Income', 'NumTotalPurchases', 'MntTotal']
scaler = gen_features(
columns = [[c] for c in SELECTED_FEATURES],
classes=[{'class': PowerTransformer, 'method': 'box-cox'}]
)
The features and scaling method was chosen by iterative process of evaluating different combinations with silhouette score
By the way, $Box-Cox$ transformation:
PowerTransformer automatically select lambda by estimating through maximum likelihood.
The feature selected are correlated, but i guess this is not really a problem in k means clustering (source). In addition to that, all other uncorrelated features are not really interesting to cluster by
scaling_mapper = DataFrameMapper(scaler, default=None, df_out=True)
data_cleaned_scaled = scaling_mapper.fit_transform(data_cleaned)
data_scaled = scaling_mapper.transform(data)
data_cleaned_scaled
| AvgCheck | Income | NumTotalPurchases | MntTotal | Age | CustomerFor | Education | Marital_Status | Kidhome | Teenhome | ... | HasPartner | NumChildren | HasChildren | percentWines | percentMeat | percentFruits | percentFish | percentSweets | percentGold | AcceptedTotal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||||||||||
| 5524 | 1.217130 | 0.340014 | 1.215416 | 1.321052 | 59 | 1703 | Bachelor | Single | 0 | 0 | ... | No | 0 | 0 | 39.270254 | 33.766234 | 5.442177 | 10.636982 | 5.442177 | 5.442177 | 1 |
| 2174 | -1.529942 | -0.229412 | -1.348257 | -1.452185 | 62 | 857 | Bachelor | Single | 1 | 1 | ... | No | 2 | 1 | 40.740741 | 22.222222 | 3.703704 | 7.407407 | 3.703704 | 22.222222 | 0 |
| 4141 | 0.399683 | 0.968542 | 1.024133 | 0.661824 | 51 | 1204 | Bachelor | Together | 0 | 0 | ... | Yes | 0 | 0 | 54.896907 | 16.365979 | 6.314433 | 14.304124 | 2.706186 | 5.412371 | 0 |
| 6182 | -1.260010 | -1.238166 | -0.879737 | -1.123894 | 32 | 797 | Bachelor | Together | 1 | 0 | ... | Yes | 1 | 1 | 20.754717 | 37.735849 | 7.54717 | 18.867925 | 5.660377 | 9.433962 | 0 |
| 5324 | 0.094285 | 0.347367 | 0.366777 | 0.177483 | 35 | 1053 | PhD | Married | 1 | 0 | ... | Yes | 1 | 1 | 40.995261 | 27.962085 | 10.189573 | 10.900474 | 6.398104 | 3.554502 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8080 | -1.447733 | -1.229040 | -1.639490 | -1.544204 | 30 | 1573 | Bachelor | Single | 0 | 0 | ... | No | 0 | 0 | 22.727273 | 27.272727 | 4.545455 | 13.636364 | 18.181818 | 13.636364 | 0 |
| 10870 | 1.393403 | 0.485787 | 0.602499 | 1.144608 | 49 | 1273 | Bachelor | Married | 0 | 1 | ... | Yes | 1 | 1 | 52.870992 | 13.571961 | 3.206562 | 3.131991 | 8.799403 | 18.41909 | 0 |
| 7270 | 1.132274 | 0.285028 | 0.820591 | 1.073275 | 35 | 1047 | Bachelor | Divorced | 0 | 0 | ... | No | 0 | 0 | 73.166801 | 17.485898 | 3.867849 | 2.578566 | 0.966962 | 1.933924 | 1 |
| 8235 | 0.441585 | 0.859557 | 1.121186 | 0.731884 | 60 | 1048 | Master | Together | 0 | 1 | ... | Yes | 1 | 1 | 50.771056 | 25.385528 | 3.558719 | 9.489917 | 3.558719 | 7.236062 | 0 |
| 9405 | -0.299587 | 0.088141 | -0.500678 | -0.445691 | 62 | 1514 | PhD | Married | 1 | 1 | ... | Yes | 2 | 1 | 48.837209 | 35.465116 | 1.744186 | 1.162791 | 0.581395 | 12.209302 | 1 |
2035 rows × 39 columns
g = sns.pairplot(data=data_cleaned_scaled[SELECTED_FEATURES],
kind='scatter')
g.fig.suptitle('Selected features after scaling')
plt.tight_layout()
Choosing the number of clusters
I will use the elbow rule and silhouette score visualizasion for choosing the optimal number of clusters (k)
options = range(2,9)
inertias = []
for n_clusters in options:
model = KMeans(n_clusters, random_state=42, init='k-means++').fit(data_cleaned_scaled[SELECTED_FEATURES])
inertias.append(model.inertia_)
plt.figure(figsize=(20,10))
plt.title("The elbow rule visualisation")
plt.plot(options, inertias, '-o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia');
k_range = range(2, 6)
fig, axes = plt.subplots(4, 1, figsize=(10, 18))
for i in k_range:
model = KMeans(i, init='k-means++', n_init=100, random_state=42)
visualizer = SilhouetteVisualizer(model, colors='yellowbrick', ax=axes[i-2])
visualizer.fit(data_cleaned_scaled[SELECTED_FEATURES])
visualizer.finalize()
axes[i-2].set_xlim(-0.1, 1)
plt.tight_layout()
I think 4 is the optimal number of clusters.
According to the elbow rule plot, 4 or 5 clusters could be optimal
Looking on the silhouette coefficient visualisation, 4 clusters provides relatively high silhouette scores for each cluster.
kmeans = KMeans(n_clusters=4, init='k-means++', random_state=42)
kmeans.fit(data_cleaned_scaled[SELECTED_FEATURES])
pred = kmeans.predict(data_cleaned_scaled[SELECTED_FEATURES])
data_cleaned_clustered = data_cleaned.copy()
data_cleaned_clustered['Cluster'] = pred + 1
fig = px.scatter_3d(data_cleaned_clustered, x="Income", y="AvgCheck", z="MntTotal", color='Cluster', width=800, height=800)
fig.show()
PCA visualisation
pca = PCA(n_components=2)
pca_data = pca.fit_transform(data_cleaned_scaled[SELECTED_FEATURES])
pca_df = pd.DataFrame.from_records(data=pca_data, columns=["x1","x2"])
pca_df["Cluster"] = pred + 1
fig = px.scatter(pca_df, x="x1", y="x2", color='Cluster', width=800, height=800)
fig.show()
The model confuses cluster 2 and 3 a little, but that is not a big deal in general
PCA's eigenvectors explained variance:
pca.explained_variance_ratio_
array([0.89877585, 0.06084716])
pca.explained_variance_ratio_.sum()
0.9596230051714931
One eigenvector explains around 90% of variance, and that is occured because of correlated variables
And now let's predict cluster labels on full dataset with outliers
full_preds = kmeans.predict(data_scaled[SELECTED_FEATURES])
data_clustered = data.copy()
data_clustered['Cluster'] = full_preds + 1
data_clustered
| Age | CustomerFor | Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | ... | percentWines | percentMeat | percentFruits | percentFish | percentSweets | percentGold | NumTotalPurchases | AvgCheck | AcceptedTotal | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||||||||||
| 5524 | 59 | 1703 | Bachelor | Single | 58138.0 | 0 | 0 | 58 | 635 | 88 | ... | 39.270254 | 33.766234 | 5.442177 | 10.636982 | 5.442177 | 5.442177 | 22 | 73.500000 | 1 | 3 |
| 2174 | 62 | 857 | Bachelor | Single | 46344.0 | 1 | 1 | 38 | 11 | 1 | ... | 40.740741 | 22.222222 | 3.703704 | 7.407407 | 3.703704 | 22.222222 | 4 | 6.750000 | 0 | 1 |
| 4141 | 51 | 1204 | Bachelor | Together | 71613.0 | 0 | 0 | 26 | 426 | 49 | ... | 54.896907 | 16.365979 | 6.314433 | 14.304124 | 2.706186 | 5.412371 | 20 | 38.800000 | 0 | 2 |
| 6182 | 32 | 797 | Bachelor | Together | 26646.0 | 1 | 0 | 26 | 11 | 4 | ... | 20.754717 | 37.735849 | 7.547170 | 18.867925 | 5.660377 | 9.433962 | 6 | 8.833333 | 0 | 1 |
| 5324 | 35 | 1053 | PhD | Married | 58293.0 | 1 | 0 | 94 | 173 | 43 | ... | 40.995261 | 27.962085 | 10.189573 | 10.900474 | 6.398104 | 3.554502 | 14 | 30.142857 | 0 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9432 | 39 | 1400 | Bachelor | Together | 666666.0 | 1 | 0 | 23 | 9 | 14 | ... | 14.516129 | 29.032258 | 22.580645 | 12.903226 | 1.612903 | 19.354839 | 7 | 8.857143 | 0 | 3 |
| 10870 | 49 | 1273 | Bachelor | Married | 61223.0 | 0 | 1 | 46 | 709 | 43 | ... | 52.870992 | 13.571961 | 3.206562 | 3.131991 | 8.799403 | 18.419090 | 16 | 83.812500 | 0 | 3 |
| 7270 | 35 | 1047 | Bachelor | Divorced | 56981.0 | 0 | 0 | 91 | 908 | 48 | ... | 73.166801 | 17.485898 | 3.867849 | 2.578566 | 0.966962 | 1.933924 | 18 | 68.944444 | 1 | 2 |
| 8235 | 60 | 1048 | Master | Together | 69245.0 | 0 | 1 | 8 | 428 | 30 | ... | 50.771056 | 25.385528 | 3.558719 | 9.489917 | 3.558719 | 7.236062 | 21 | 40.142857 | 0 | 2 |
| 9405 | 62 | 1514 | PhD | Married | 52869.0 | 1 | 1 | 40 | 84 | 3 | ... | 48.837209 | 35.465116 | 1.744186 | 1.162791 | 0.581395 | 12.209302 | 8 | 21.500000 | 1 | 4 |
2052 rows × 40 columns
sizes = dict(data_clustered['Cluster'].value_counts())
plt.figure(figsize=(12, 8))
plt.title("Clusters proportions")
plt.pie(sizes.values(), labels=sorted(sizes.keys()), autopct="%.1f%%", pctdistance=0.85, shadow=True, colors=PALETTE)
plt.legend(title="Customer's cluster", labels=sorted(sizes.keys()), bbox_to_anchor=(1, 1))
# add a circle at the center to transform it in a donut chart
my_circle=plt.Circle( (0,0), 0.7, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.show()
Cluster 1 is the biggets cluster, around 1/3 of all customers. Clusters 2, 3 and 4 are around same sizes
plt.figure(figsize=(16,5))
plt.title(f'Customers income by cluster')
ax = sns.boxplot(data=data_clustered, x='Cluster', y='Income', palette=PALETTE, showfliers=False)
plt.show()
Cluster 1: low income
Cluster 2: high income
Cluster 3: very high income
Cluster 4: medium income
data_clustered.query('Income > 140000')
| Age | CustomerFor | Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | ... | percentWines | percentMeat | percentFruits | percentFish | percentSweets | percentGold | NumTotalPurchases | AvgCheck | AcceptedTotal | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||||||||||
| 8475 | 43 | 1069 | PhD | Married | 157243.0 | 0 | 1 | 98 | 20 | 2 | ... | 1.243781 | 98.383085 | 0.124378 | 0.062189 | 0.124378 | 0.062189 | 22 | 73.090909 | 0 | 3 |
| 1503 | 40 | 1372 | PhD | Together | 162397.0 | 1 | 1 | 31 | 85 | 1 | ... | 79.439252 | 14.953271 | 0.934579 | 1.869159 | 0.934579 | 1.869159 | 1 | 107.000000 | 0 | 3 |
| 1501 | 34 | 1704 | PhD | Married | 160803.0 | 0 | 0 | 21 | 55 | 16 | ... | 3.203262 | 94.467094 | 0.931858 | 0.990099 | 0.174723 | 0.232964 | 29 | 59.206897 | 0 | 3 |
| 5336 | 45 | 1341 | Master | Together | 157733.0 | 1 | 0 | 37 | 39 | 1 | ... | 66.101695 | 15.254237 | 1.694915 | 3.389831 | 0.000000 | 13.559322 | 2 | 29.500000 | 0 | 3 |
| 4931 | 39 | 1318 | Bachelor | Together | 157146.0 | 0 | 0 | 13 | 1 | 0 | ... | 0.057803 | 99.710983 | 0.000000 | 0.115607 | 0.057803 | 0.057803 | 28 | 61.785714 | 0 | 3 |
| 9432 | 39 | 1400 | Bachelor | Together | 666666.0 | 1 | 0 | 23 | 9 | 14 | ... | 14.516129 | 29.032258 | 22.580645 | 12.903226 | 1.612903 | 19.354839 | 7 | 8.857143 | 0 | 3 |
6 rows × 40 columns
And the income outliers are distributed in 3rd cluster
plt.figure(figsize=(16,5))
plt.title(f'Customers amount spent by clusters')
ax = sns.boxplot(data=data_clustered, x='Cluster', y='AvgCheck', palette=PALETTE, showfliers=False)
plt.show()
Average check corresponds to income of clusters, but the gap between 3rd cluster and others is huge
plt.figure(figsize=(16,5))
plt.title(f'Customers amount spent by clusters')
ax = sns.boxplot(data=data_clustered, x='Cluster', y='NumTotalPurchases', palette=PALETTE, showfliers=False)
plt.show()
Cluster 2 and 3 customers are the most active and frequent buyers, cluster 4 have medium frequency, and cluster 1 has low frequency of purchases
plt.figure(figsize=(16,5))
plt.title(f'Countplot of education degrees by clusters')
sns.countplot(data=data_clustered, x='Education', hue='Cluster', palette=PALETTE)
plt.show()
The Basic degree is presented mostly in 1st cluster
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
k = 0
for i in range(0, 2):
for j in range(0, 2):
k += 1
sizes = dict(data_clustered.query(f'Cluster == {k}')['HasChildren'].value_counts().sort_index(ascending=False))
axes[i, j].set_title(f"Cluster {k}")
axes[i, j].pie(sizes.values(), labels=['Yes', 'No'], autopct="%.1f%%", pctdistance=0.75, shadow=True, colors=PALETTE)
fig.suptitle('Having children in different clusters')
fig.legend(title="Does the customer have children", labels=['Yes', 'No'], bbox_to_anchor=(1, 1))
fig.show()
There are mostly parents in 1, 2, 4 clusters. And customers in 3rd clusters are mostly single
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
k = 0
for i in range(0, 2):
for j in range(0, 2):
k += 1
num = data_clustered.query(f'Cluster == {k}').filter(regex='Num[^Deals|Total].+Purchases').sum(axis=0)
sizes = dict(num)
axes[i, j].set_title(f"Cluster {k}")
axes[i, j].pie(sizes.values(), labels=['Website', 'Catalog', 'Store'], autopct="%.1f%%", pctdistance=0.75, shadow=True, colors=PALETTE)
fig.suptitle('Shopping types proportions for each cluster')
fig.legend(title="Shopping type", labels=['Website', 'Catalog', 'Store'], bbox_to_anchor=(1, 1))
fig.show()
Customers from 2nd and 3rd clusters buy from catalog more than from 1 and 4 clusters. Maybe catalog products are new products and they are pretty expensive
Now let's look on number of website visits by clusters
plt.figure(figsize=(16,5))
plt.title(f'Number of web visits per month by clusters')
ax = sns.boxplot(data=data_clustered, x='Cluster', y='NumWebVisitsMonth', palette=PALETTE, showfliers=False)
plt.show()
1st and 4th clusters visits the website the most
fig, axes = plt.subplots(4, 6, figsize=(16, 15), sharey=True)
fig.suptitle('Percent of amount spent on each product type from total amount spent by education', fontsize=20)
for i, value in enumerate(sorted(data_clustered['Cluster'].unique())):
sns.boxplot(data=data_clustered.query(f'Cluster == {value}'), y='percentWines', showfliers=False, color=PALETTE[i], ax=axes[i, 0])
axes[i, 0].set_ylim(0, 100)
axes[i, 0].set_xlabel('Wine')
axes[i, 0].set_ylabel('')
sns.boxplot(data=data_clustered.query(f'Cluster == {value}'), y='percentMeat', showfliers=False, color=PALETTE[i], ax=axes[i, 1])
axes[i, 1].set_xlabel('Meat')
axes[i, 1].set_ylabel('')
sns.boxplot(data=data_clustered.query(f'Cluster == {value}'), y='percentFruits', showfliers=False, color=PALETTE[i], ax=axes[i, 2])
axes[i, 2].set_xlabel('Fruits')
axes[i, 2].set_ylabel('')
axes[i, 2].set_title(f'Cluster {value}', x=1)
sns.boxplot(data=data_clustered.query(f'Cluster == {value}'), y='percentFish', showfliers=False, color=PALETTE[i], ax=axes[i, 3])
axes[i, 3].set_xlabel('Fish')
axes[i, 3].set_ylabel('')
sns.boxplot(data=data_clustered.query(f'Cluster == {value}'), y='percentSweets', showfliers=False, color=PALETTE[i], ax=axes[i, 4])
axes[i, 4].set_xlabel('Sweets')
axes[i, 4].set_ylabel('')
sns.boxplot(data=data_clustered.query(f'Cluster == {value}'), y='percentGold', showfliers=False, color=PALETTE[i], ax=axes[i, 5])
axes[i, 5].set_xlabel('Gold')
axes[i, 5].set_ylabel('')
plt.tight_layout()
As we see, popular products types are the same in all clusters: wine and meat. But cluster 3 buys Meat more than others. Cluster 1 buy Gold products in addition to Wine and Meat
fig, axes = plt.subplots(4, 1, figsize=(16, 20))
for i in range(1, 5):
ax = (data_clustered.query(f'Cluster == {i}')[['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response']]
.sum()
.divide(data_clustered.query(f'Cluster == {i}').shape[0]).multiply(100)
.plot(kind='bar', figsize=(18,15), title=f'% of customers from Cluster {i} accepted different campaigns', ax=axes[i-1], color=PALETTE[i-1]))
ax.set_xticklabels(['Campaign 1', 'Campaign 2', 'Campaign 3', 'Campaign 4', 'Campaign 5', 'Last campaign'], rotation=0)
plt.tight_layout()
We see that:
complains_by_cluster = (data_clustered.groupby(by='Cluster')['Complain'].sum()
.divide(data_clustered['Cluster'].value_counts())
.multiply(100))
ax = complains_by_cluster.plot(kind='bar', figsize=(18, 8), color=PALETTE[:4],
title='Percent of complained customers for the last 2 years in different clusters',
ylabel='%', xlabel='Cluster')
ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
plt.show()
Cluster 1 purchases less, but complains more, thats interesting
I would rate clusters as follow:
Platinum customers:
Gold customers:
Silver customers:
Bronze customers:
Aquired skills: